Raft's Log Replication Protocol
Learn how the log replication works in Raft.
We'll cover the following
A log comprises one or multiple log entries, and each entry holds an executable application-specific command. The basic responsibility of Raft is to maintain a consistent sequential log across all servers. Let’s see how Raft implements this in its algorithm.
Log replication#
After a leader is chosen in an election, it handles client requests. These requests contain instructions to be executed by the state machines. The leader adds the instruction to its log as a new entry and sends AppendEntries RPCs to all other servers in parallel to replicate the entry. Once the entry has been replicated safely (after a majority of the followers have successfully replicated it), the leader applies it to its own state machine and provides the result of the execution to the client.
Suppose followers malfunction or operate slowly, or network packets are lost. In that case, the leader will continue to retry AppendEntries RPCs until all followers store all log entries, even after it has already responded to the client.
1 of 5
2 of 5
3 of 5
4 of 5
5 of 5
Every record in the log contains an instruction for the state machine and the term number corresponding to when the leader received this instruction. The term numbers recorded in the log are utilized to identify disparities among logs, which we’ll discuss later in this lesson. Each entry in the log is assigned a unique integer index indicating its position within the log.
The following illustration shows a sample log organization in a Raft group:
The responsibility of determining when it is safe to implement a log entry into the state machines lies with the leader. This process is known as committing an entry. The Raft consensus algorithm ensures that committed entries are durable and will eventually be executed by all available state machines. An entry in the log is considered committed when the leader who created it has replicated it on most servers. It also commits all previous entries in the leader’s log, even the ones that were created by earlier leaders. (We’ll elaborate on this point in the next lesson.) The leader keeps track of the highest index that has been committed and includes it in future AppendEntries RPCs, including heartbeats, so that the other servers are eventually informed. When a follower learns that a log entry has been committed, it applies the entry to its local state machine in the order it appears in the log.
// LogEntry represents a single entry in the Raft log
type LogEntry struct {
Term int // Term in which the entry was created
Data interface{} // Application-specific command/data
}
func (rf *Raft) logReplication() {
for {
rf.mu.Lock()
for i := range rf.peers {
// Skip sending log entries to self
if i == rf.me {
continue
}
// Send AppendEntries RPC to follower
go func(peer int) {
rf.mu.Lock()
if rf.state != Leader {
rf.mu.Unlock()
return
}
// Prepare arguments for AppendEntries RPC
prevLogIndex := rf.nextIndex[peer] - 1
prevLogTerm := 0
if prevLogIndex > 0 {
prevLogTerm = rf.log[prevLogIndex].Term
}
entries := make([]LogEntry, len(rf.log[rf.nextIndex[peer]:]))
copy(entries, rf.log[rf.nextIndex[peer]:])
rf.mu.Unlock()
// Send AppendEntries RPC
args := AppendEntriesArgs{
Term: rf.currentTerm,
LeaderID: rf.me,
PrevLogIndex: prevLogIndex,
PrevLogTerm: prevLogTerm,
Entries: entries,
LeaderCommit: rf.commitIndex,
}
var reply AppendEntriesReply
if rf.sendAppendEntries(peer, &args, &reply) {
rf.mu.Lock()
defer rf.mu.Unlock()
// Update nextIndex and matchIndex
if reply.Success {
rf.nextIndex[peer] = args.PrevLogIndex + len(args.Entries) + 1
rf.matchIndex[peer] = rf.nextIndex[peer] - 1
// Update commitIndex if majority has replicated log entries
for n := rf.commitIndex + 1; n < len(rf.log); n++ {
if rf.log[n].Term != rf.currentTerm {
continue
}
count := 1
for i := range rf.peers {
if i == rf.me {
continue
}
if rf.matchIndex[i] >= n {
count++
}
}
if count > len(rf.peers)/2 {
rf.commitIndex = n
}
}
} else {
// Decrement nextIndex and retry AppendEntries RPC
if reply.Term > rf.currentTerm {
rf.currentTerm = reply.Term
rf.state = Follower
rf.votedFor = -1
rf.persist()
return
}
rf.nextIndex[peer]--
}
}
}(i)
}
rf.mu.Unlock()
// Sleep to prevent busy looping
time.Sleep(50 * time.Millisecond)
}
}
The Log Matching Property#
The Raft log mechanism was created to synchronize the logs on various servers. This simplifies the system’s operations and enhances predictability, and is crucial for ensuring safety. The Log Matching Property comprises several properties maintained by Raft, including the following subproperties:
- If two entries with the same index and term appear in different logs, they must contain the same command.
- If two entries in separate logs have the same index and term, all previous entries must be identical.
The first property is a consequence of the leader’s capability to generate a single record per log index in a term. Since the entries are also permanent, they never change position. The second property is ensured through a basic consistency check executed by AppendEntries. When the leader sends an AppendEntries RPC, it includes the previous entry’s index and term from its log that immediately precedes the new entries. If the follower cannot locate an entry with the same index and term in its log, it denies the new entries. This consistency check functions as an inductive process: the Log Matching Property is fulfilled by the empty state of the logs at the start. The consistency check upholds the Log Matching Property whenever logs are extended. Hence, whenever AppendEntries executes successfully, the leader knows that the follower’s log corresponds to its own log up to the new entries.
Inconsistent logs#
In normal operation, the leader and followers maintain synchronized logs to ensure that the consistency check for AppendEntries always succeeds. However, if the leader crashes, the logs can become inconsistent because the old leader may have yet to fully replicate all the entries in its log. These inconsistencies can accumulate over time due to subsequent leader and follower crashes. The follower’s log may be lacking entries that exist on the leader, it may have additional entries that are absent on the leader, or it may have both missing and extra entries in its log that may span across multiple terms.
The following illustration depicts some probable scenarios of inconsistent logs. Each box represents a single log entry with the number corresponding to the term number written inside it:
1 of 6
2 of 6
3 of 6
4 of 6
5 of 6
6 of 6
Consistency check#
In order to ensure that a follower’s log is consistent with the leader’s log, the leader must locate the most recent entry where the two logs match, remove any entries in the follower’s log after that point, and then transmit all entries from the leader’s log after that point to the follower. These actions are initiated by the consistency check conducted by AppendEntries RPCs.
The leader maintains a nextIndex for each follower, which specifies the index of the next log entry that the leader will send to that follower. When a leader assumes power, it initializes all nextIndex values to the index immediately following the last one in its log. When the logs of a follower and the leader do not match, the consistency check for AppendEntries will fail in the next AppendEntries RPC. After rejection, the leader reduces nextIndex and repeatedly attempts the AppendEntries RPC. Eventually, nextIndex will reach a point where the leader and follower logs agree. When this happens, AppendEntries succeeds, eliminating any conflicting entries in the follower’s log and appending entries from the leader’s log (if any). When AppendEntries succeeds, the follower’s log is consistent with the leader’s, and it will remain that way for the rest of the term.
1 of 13
2 of 13
3 of 13
4 of 13
5 of 13
6 of 13
7 of 13
8 of 13
9 of 13
10 of 13
11 of 13
12 of 13
13 of 13
Note: Over here, we applied the consistency check mechanism to the scenario where the follower log had multiple missing values as compared to the leader's log. We encourage learners to apply the consistency check mechanism to all other scenarios discussed earlier as well.
Point to ponder
Question
How can we optimize the consistency check mechanism?
If needed, the consensus algorithm can be improved to lessen the amount of rejected AppendEntries RPCs. For example, when a follower rejects an AppendEntries request, it can provide the term of the conflicting entry and the last index stored for that term in its log. Using this information, the leader can decrease nextIndex to skip all the conflicting entries in that term. This results in only one AppendEntries RPC being needed for each term with inconsistent entries instead of one RPC for each entry.
Referring to the above illustration:
- The leader assumes the authority and initiates the
nextIndexas the next index in its own log, index 11 in the illustration. - The leader sends the
AppendEntriesRPC with a consistency check to see if the follower’s last index, index 8, of the committed entry matches its index. - The follower rejects the
AppendEntriesRPC and replies with the term of the conflicting entry, term 6, and the last index of that term, index 8. - The leader skips all the conflicting intermediate entries and sets the value of
nextIndexas the next of the received index value from the follower.
However, this optimization may not be necessary as failures are rare and inconsistent entries are unlikely to occur frequently in practice.
Using this mechanism, a leader doesn’t have to take special measures to ensure log consistency upon regaining power. The leader can resume normal operation, and the logs will automatically converge in response to failures during the AppendEntries consistency check. The leader always adheres to the Leader Append-Only Property and never alters or deletes entries in its own log.
Point to ponder
Question
Why is the Leader Append-Only Property important, and what are the potential risks of not following it?
Raft’s Leader Append-Only property states that a leader only performs one action on its log, which only appends entries. It neither deletes nor overwrites them. This is a crucial property to ensure data consistency and safety among servers.
- Data inconsistency: If a leader deletes or overwrites its previous log entries, those entries would not be consistent among servers, hence the inconsistency.
- Data loss: Not meeting this property can also lead to data loss. If a leader alters or deletes its log entries but before committing these changes to other servers, it crashes, then that data would be lost.
Raft can accept, replicate, and apply new log entries as long as most servers are functioning. In typical situations with normal operations, where we don’t have any malfunctions, a new entry can be replicated with a single round of RPCs to the majority of the cluster (one set of RPCs from the leader to all the followers and one set of replies from all the followers). Hence, having a single slow follower will not impact the system’s performance heavily.
In the next lesson, we’ll learn some additional protocols of the Raft consensus algorithm covering safety, fault tolerance, and availability.
Raft's Leader Election Protocol
Raft's Safety, Fault-Tolerance, and Availability Protocols